Python has a csv reader/writer as part of its built in library. It is called csv. This is the simplest way to read tabular data (data in table format). The type of data you used to use excel to process (hopefully you will try out python now). It must be in text format to use the csv module, so .csv (comma separated) or .tsv (tab separated)
Here is the documentation: https://docs.python.org/2/library/csv.html
To use it, first you must import it
import csv
In [ ]:
import csv
Next we create a csv reader. You give it a file handle and optionally the dialect, the separator (usually commas or tabs), and the quote character.
with open(filename, 'r') as fh:
reader = csv.reader(fh, delimiter='\t', quotechar='"')
In [ ]:
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
The reader doesn't do anything yet. It is a generator that allows you to loop through the data (it is very similar to a file handle).
To loop through the data you just write a simple for loop
for row in reader:
#process row
The each row will be a list with each element corresponding to a single column.
In [ ]:
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
for row in reader:
print(row)
In [ ]:
Doesn't that look nice?
Well there are a few problems that I can see. First the header, how do we deal with that?
The easiest way I have found is to use the next method (that is available with any generator) before the for loop and to store that in a header variable. That reads the first line and stores it (so that you can use it later) and then advances the pointer to the next line so when you run the for loop it is only on the data.
header = reader.next()
for row in reader:
#process data
In [ ]:
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
for row in reader:
print(row)
print("Header", header)
In [ ]:
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
for row in reader:
float_row = [float(row[0]), float(row[1])]
print(float_row)
In [ ]:
In [ ]:
# Let's find the average distance for all walks.
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
# Empty list for storing all distances
walks = []
for row in reader:
#distance is in the first column
dist = row[0]
# Convert to float so we can do math
dist = float(dist)
# Append to our list
walks.append(dist)
# Use list aggregation methods to get average distance
ave_dist = sum(walks) / len(walks)
print("Average distance walked: {0:.1f}".format(ave_dist))
In [ ]:
# Let's see our pace for each walk
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
for row in reader:
#distance is in the first column
dist = row[0]
# Convert to float so we can do math
dist = float(dist)
#time in minutes is in the second column
time_minutes = row[1]
# Convert to float so we can do math
time_minutes = float(time_minutes)
# calculate pace as minutes / kilometer
pace = time_minutes /dist
print("Pace: {0:.1f} min/km".format(pace))
# If you want a challenge, try to make this seconds/mile
In [ ]:
# We can filter data. Let's get the ave pace only for walks longer than
# 3 km
# Let's see our pace for each walk
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
paces = []
for row in reader:
#distance is in the first column
dist = row[0]
# Convert to float so we can do math
dist = float(dist)
# Don't count short walks
if dist >= 3.0:
#time in minutes is in the second column
time_minutes = row[1]
# Convert to float so we can do math
time_minutes = float(time_minutes)
pace = time_minutes /dist
paces.append(pace)
ave_pace = sum(paces) / len(paces)
print("Average walking pace: {0:.1f} min/km".format(ave_pace))
Here is something I do all the time. It is a little more complicated than the above examples, so take your time trying to understand it. What I like to do is to read the csv data and transform it to a dictionary of lists. This allows me to use it in many different ways later in the code. It is most useful with larger dataset that I will be analyzing and using many different times. (You can even print it out as JSON!)
In [ ]:
# Lets see our pace for each walk
with open('walks.csv', 'r') as fh:
reader = csv.reader(fh, delimiter=',')
header = next(reader)
# This is the dictionary we will put our data from the csv into
# The key's are the column headers and the values is a list of
# all the data in that column (transformed into floats)
data = {}
# Initialize our dictionary with keys from header and values
# as empty lists
for column in header:
data[column] = []
for row in reader:
# Enumerate give us the index and the value so
# we don't have to use a count variable
for index, column in enumerate(header):
# convert data point to float
data_point = float(row[index])
# append data to dictionary's list for that column
data[column].append(data_point)
# look at that beautiful data. You can do anything with that!
print(data)
In [ ]:
The csv module also contains code for writing csvs.
To write, you create a writer using the writer method and give it a filehandle and optionally delimiter and quotechar.
with open('my_file.csv', 'w') as fh:
writer = csv.writer(fh, delimiter=',', quotechar='"')
Then use the writerow method with a list to write as it's argument.
writer.writerow([item1, item2])
In [ ]:
import random
with open('sleep.csv', 'w') as fh:
writer = csv.writer(fh, delimiter='\t', quotechar='"')
header = ['day', 'sleep (hr)']
writer.writerow(header)
for i in range(1,11):
hr_sleep = random.randint(4,10)
writer.writerow([i, hr_sleep])
#open the file to prove you wrote it. (Open in excel for best results)
In [ ]:
The file fiveK.csv contains data from the top 100 female finishers for the Firecracer 5k held in Reston on the 4th of July 2016 http://www.prraces.com/firecracker/
results.csv
In [ ]: